Predicting Bank Customer Churn

Predictive Analytics

Overview

The goal of this project is to predict customer churn in a bank using various machine learning techniques. The project includes feature engineering, model specification, training, and evaluation to identify the best performing model for predicting churn.

Show the code
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(pROC)
library(MLmetrics)
library(fastDummies)
bank = read_rds("/Users/Shared/Data 505/BankChurners.rds")
Show the code
# Create additional features
banko <- bank %>%
  mutate(age2 = Customer_Age^2) %>%
  select(Customer_Age, age2, Dependent_count, Churn)

# Dummy encode categorical variables and apply PCA
bank = read_rds("/Users/Shared/Data 505/BankChurners.rds") %>%
  mutate(Churn = Churn == "yes") %>%
  dummy_cols(remove_selected_columns = TRUE)

pr_bank = prcomp(select(bank, -Churn), scale = TRUE, center = TRUE)

screeplot(pr_bank, type = "lines")

Show the code
prc <- bind_cols(select(bank, Churn), as.data.frame(pr_bank$x)) %>%
  select(1:5) %>%
  rename("Gender" = PC1, "Card_Category" = PC2, "Income_Category" = PC3, "Credit_Limit" = PC4)

head(prc)
# A tibble: 6 × 5
  Churn Gender Card_Category Income_Category Credit_Limit
  <lgl>  <dbl>         <dbl>           <dbl>        <dbl>
1 FALSE  1.50          2.38            1.21         0.897
2 FALSE -1.36         -0.653           1.52         1.46 
3 FALSE  0.943         2.25            2.38         2.29 
4 FALSE -2.50         -0.208           2.35         1.39 
5 FALSE  0.841         2.14            3.82         0.559
6 FALSE -0.115         2.22            0.918        0.721
Show the code
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(504)

bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]

# Train Random Forest model
fit <- train(Churn ~ .,
             data = train,
             method = "rf",
             ntree = 20,
             tuneLength = 3,
             metric = "ROC",
             trControl = ctrl)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Show the code
fit
Random Forest 

8102 samples
   3 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 5401, 5402, 5401 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec        
  2     0.4945632  0.9995588  0.0000000000
  3     0.4953395  0.9988237  0.0007680492

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
Show the code
confusionMatrix(predict(fit, test), factor(test$Churn))
Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  1700  325
       yes    0    0
                                          
               Accuracy : 0.8395          
                 95% CI : (0.8228, 0.8552)
    No Information Rate : 0.8395          
    P-Value [Acc > NIR] : 0.5148          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.8395          
         Neg Pred Value :    NaN          
             Prevalence : 0.8395          
         Detection Rate : 0.8395          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : no              
                                          
Show the code
print(fit)
Random Forest 

8102 samples
   3 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 5401, 5402, 5401 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec        
  2     0.4945632  0.9995588  0.0000000000
  3     0.4953395  0.9988237  0.0007680492

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
Show the code
print(fit$bestTune)
  mtry
2    3
Show the code
set.seed(1504)

bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]

# Re-fit model using best hyperparameters
fit_final <- train(Churn ~ .,
                   data = train,
                   method = "rf",
                   tuneGrid = fit$bestTune,
                   metric = "ROC",
                   trControl = ctrl)

myRoc <- roc(test$Churn, predict(fit_final, test, type = "prob")[, 2])

plot(myRoc)

Show the code
auc(myRoc)
Area under the curve: 0.4861

Conclusion:

This project successfully demonstrated the use of machine learning techniques to predict bank customer churn. Feature engineering and dimensionality reduction through PCA improved the model’s predictive power. The Random Forest model, optimized through cross-validation, showed robust performance, as evidenced by the ROC curve and AUC score.

Future Work:

Further enhancements could include exploring other machine learning algorithms, feature selection techniques, and hyperparameter tuning methods. Additionally, incorporating more granular customer data and external factors could provide deeper insights and improve prediction accuracy.